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I.  ALGORITHM  CONFLICT  SENSITIVITY 


A.  INTRODUCTION 

In  a  companion  report  [1],  the  effect  of  different  memory 
conflict  resolution  protocols  on  delays  of  memory  accesses  was 
studied.  These  access  delays  produce  a  delay  in  algorithm 
execution  or  algorithm  delay .  However/  the  relationship  between 
these  two  delays  has  not  been  investigated  in  the  literature,  in 
part  because  the  hardware  cannot  measure  access  delay  in  general, 
and  partly  because  a  memory  access  simulator  is  far  easier  to 
develop  than  the  full  instruction-level  timing  simulator 
necessary  to  measure  algorithm  delay. 

A  priori,  it  may  not  seem  worth  correlating  access  and 
algorithm  delays.  An  architect  may  be  comfortable  with  the 
assumption  that  there  is  a  general  correspondence  between  the  two. 
Indeed,  it  will  be  one  of  the  purposes  of  this  research  to 
determine  a  rule-of-thumb  relationship  by  examining  some  typical 
scientific  application  codes;  the  question  of  whether  the  code 
itself  is  responsible  for  enhancing  access  delays  will  therefore 
be  answered.  Algorithmically,  however,  it  will  be  shown  that 
codes  can  be  designed  to  exploit  local  conflict-free  memory  and 
achieve  virtual  independence  of  main  memory  access  delays.  These 
will  be  termed  conflict-resistant  algorithms.  Their  study  may 
have  short  term  value  in  the  immediate  task  of  developing  library 
codes  for  the  CRAY  family  of  multiprocessors,  and  long  term  value 
in  establishing  an  additional  application  of  cache  and  local 
memory  in  MP  supercomputer  architectures. 

The  experimental  vehicle  for  this  largely  empirical  study  is 
a  CRAY  X-MP  simulator.  This  instruction-level  simulator  produces 


numerical  and  timing  results;  the  latter  are  accurate  to  within 
.2%  for  typical  codes  executing  on  a  uniprocessor  X-MP-2  without 
interprocessor  conflicts.  The  conflict  mechanism  of  the  X-MP-2  is 
also  simulated  and  has  been  found  to  be  exact  for  large  kernels  of 
read  and  write  instructions  only.  The  conflict  protocol  of  the  X- 
MP-2  has  been  adopted  in  extension  to  16  processors  (even  though 
the  X-MP-4  is  known  to  have  some  variations) ,  except  that 
processors  are  paired  to  achieve  adequate  memory  bandwidth  for 
buffer  fetches.  More  validation  is  given  in  an  appendix  of  [1]. 

B.  MEMORY  ACCESS  VERSUS  ALGORITHM  DELAYS 

1.  Critical  Path 

The  X-MP  operates  by  a  system  of  register  and  functional  unit 
reservations.  Instructions  begin  execution  either  (1)  when 
reservations  expire  on  the  resources  they  require,  or  (2)  when 
elements  of  a  vector  operand  become  available  from  a  functional 
unit  ("chaining”).  When  reads  or  writes  are  delayed  by  conflicts, 
the  associated  register  reservations  are  held  and  chains  are 
delayed. 

A  particular  instruction  issue  and/or  execution  may  or  may 
not  influence  total  execution  time.  For  example,  address 
formation  is  often  masked  by  floating  point  computation  in  a 
vectorized  code.  When  such  influence  does  exist,  the  instruction 
is  on  the  critical  path  of  execution. 

Determining  which  instructions  are  on  this  critical  path  is 
difficult  even  with  a  simulator.  For  example,  a  hold  on 
instruction  issue  does  not  guarantee  that  the  instruction  creating 
the  hold  is  on  the  critical  path;  both  of  the  instructions  may  be 


off  the  critical  path  and  their  issue  time  superfluous  to  total 
algorithm  timing.  The  critical  path  is  first  a  global  issue. 

This  critical  path  can  change  as  access  delays  lengthens;  for 
example,  a  formerly  masked  read  or  write  may  enter  the  critical 
path  when  its  execution  is  delayed  and  no  longer  masked.  An 
instruction  awaiting  two  reads,  as  V3  in  the  vector  sequence  (in 
CRAY  assembly  language) 


VO 

,  AO ,  1 

( vector 

read) 

VI 

,  AO  ,  1 

(vector 

read ) 

V2 

S1*V1 

( vector 

multiply) 

V3 

V2+FV0 

(vector 

add) 

could  be  delayed  by  conflicts  on  either  VO  or  VI;  thus  the 
algorithm  delay  is  a  function  of  delays  on  VO  and  VI  reads, 
creating  a  "worst-case"  risk  situation. 

In  contrast,  a  "best-case"  condition  occurs  when  an 
instruction  is  awaiting  availability  of  alternate  identical 
resources.  In  the  X-MP,  the  read  port  (of  two  ports)  is 
chosen  during  execution.  If  both  ports  are  busied  with  delayed 
reads,  the  first  available  one  is  used. 

2.  A  Sensitivity  Measure 

In  spite  of  the  above  threats  to  a  well-behaved  cause-effect 
relationship  between  memory  access  and  algorithm  delays,  it  is 
nonetheless  possible  to  develop  a  meaningful  sensitivity  measure 
relating  the  two.  Only  vector  access  delays  will  be  considered  in 
the  following  discussion. 

Def ine 

T  -  algorithm  execution  time 


Tm .  -  time  memory  is  busied  during  access 

of  the  ith  vector  in  the  critical  path 

AT  -  change  in  T_ 
mi  y  mi 

AT.  -  change  in  T  due  to  AT 
1  3  m^ 

AT  -  total  change  in  T  due  to  delays  in  vectors. 

where  Tm  is  VL+3  without  conflicts,  where  VL  is  the  vector 

length.  Then  if  only  the  ith  instruction  is  delayed 


AT ;  *  AT_ 
i  m. 


(1  ) 


and  the  algorithm  sensitivity  to  a  delay  in  the  ith  access  is 


s  A  fractional  change  in  T 
i  ~  Fractional  change  in  t 


(2) 


m. 

l 


(  AT  j/T)/ {  ATm  /Tm^) 


T  /T 
mi 


(3) 

(4) 


Thus,  this  normalized  sensitivity  is  not  dependent  on  the  fraction 
of  time  a  read  or  write  is  in  the  critical  path,  but,  once  in  the 
critical  path,  on  its  total  vector  length. 

If  m  vectors  are  in  the  critical  path,  and  if  the  effects  on 
T  of  each  vector  access  delay  are  independent  of  other  delays  (to 
be  tested  by  simulation),  then 

m 

(5) 


AT 


m 

Z  AT. 


.  ,  m. 

1  l 


If  all  vectors  are  delayed  by  a  uniform  fraction  of  their  lengths 


so  that  ATra  /Tm  *  D,  a  constant,  then  define  the  global 


sensitivity 


e  A  fractional  change  in  T 
*u  D 


(  AT/T)  /D 


(6) 

(7) 


m 


(  2  T  )/T 


(8) 


In  the  limiting  case,  then,  if  every  vector  access  has  one  clock 
in  the  critical  path  in  the  conflict-free  case,  the  total 
sensitivity  would  be  proportional  to  the  sum  of  all  the  read/write 
vector  lengthsl 

II.  SIMULATED  SENSITIVITY  STUDIES 

A.  LARGE-DELAY  SENSITIVITY 

The  relationship  of  Eq.  (9)  merely  represents  the  additive 
nature  of  independent  delays.  The  practical  issues  are  the 
effects  of  (a)  large  and  (b)  random  access  delays  on  the  critical 
path  or,  equivalently,  on  the  algorithm  delay.  These  two  effects 
will  be  measured  separately  by  simulation. 

The  Su  defined  in  Eq.  (6)  can  be  measured,  irrespective  of 
whether  Eq.  (9)  applies  as  a  result  of  independence.  By  disabling 
the  X-MP  conflict  resolution  protocol  in  the  simulator  and  instead 
artifically  delaying  all  accesses  a  uniform  fraction  D  of  their 
vector  lengths,  the  delay  (Du)  in  algorithm  execution  can  be 
measured  as  a  function  of  D.  This  will  test  the  dependence  of  the 
critical  path  on  large  but  uniform  delays. 

The  result  (Figure  1)  shows  that,  for  the  three  test  codes, 
the  slope  Su  remains  nearly  constant  for  delays  of  up  to  100%  of 
the  vector  length.  Thus,  under  the  assumption  of  uniform  delays, 
the  hazards  to  the  critical  path  disruption  are  insignificant  for 
access  delays  far  greater  than  likely  to  be  encountered  in 
practice.  The  incremental  sensitivities  Su  measured  at  D  *  0  are 
given  in  Table  1  for  a  large  number  of  cases. 

It  should  be  noted  that  Eq.  (9)  has  been  verified  by 
inspection  of  clock-level  timing  of  MUL2  and  CFD  executions. 


Thus,  algorithm  delay  seems  likely  to  be  independent  of  access 
delays  for  even  large  uniform  delays. 

B.  EFFECTS  OF  RANDOMNESS 

With  the  conflict  protocol  enabled,  the  delay  in  algorithm 
execution  (Da^)  was  measured  for  all  processors  involved  in  a 

simulation,  and  the  delays  averaged  (D^).  The  delays  of  all 


accesses  were  also  recorded  and  averaged  (D  ).  These  delays  were 

aC 

normalized  by  dividing  by  the  total  algorithm  execution  time  and 


by  VL  (=64  for  all  codes) ,  respectively;  this  yielded  5  ,  and  D_  . 

31  Si  C 

The  sensitivity 


Sal  -  5  . /D 
al  al  ac 


is  then  the  measure  of  the  random,  large-deviation  sensitivity 


encountered  in  practice. 


Table  1  indicates  that  Su  and  Sal  are  sufficiently  different 
that  the  critical  path  must  be  moderately  altered  in  some  codes. 
Since  large  uniform  deviations  have  been  shown  to  have  nominal 
effects  on  sensitivity,  one  is  left  to  conclude  that  it  is  the 
randomness  which  disrupts  the  critical  path.  This  is  consistent 
with  the  previous  discussion  of  how  the  critical  path  is  altered, 
e.g.,  by  masking  and  by  best-case  and  worst-case  events. 


It  appears  that  the  sensitivity  to  a  single  delayed  access 
should  be  less  than  unity;  the  provision  for  late  chaining  avoids 
the  prospect  of  a  delayed  access  causing  a  missed  chain-slot  time, 
as  in  the  CRAY- 1 .  However,  it  is  unclear  whether  Sa^  can  be 
greater  than  unity.  Nonetheless,  Sal  <  1  for  all  measured 
sensitivities  (Table  2),  with  the  largest  being  .939. 


Code 

Banks 

|  Utilizations 

|  \  Ub 

|  Delays  *  100 

% 

|  Dm  Da 

|  Sensitivities 

|  Sal  Su  Sal^» 

FFT 

256 

|  .653 

.965 

1  7.7 

3.8 

|  .493 

.417 

.755 

128 

.629 

.998 

16.8 

1  -  -  - 

8.1 

.482 

.417 

.766 

CFD 

256 

.682 

.986 

1 

6.0 

2.3 

|  _  _  _  _ 

.383 

.336 

.561 

128 

|  .668 

.999 

|  12.3 

4.3 

|  .349 

.336 

.522 

MUL1 

256 

|  .690 

.832 

1  5.1 

2.5 

|  .490 

.464 

.710 

128 

|  .664 

.921 

|  16.8 

6.5 

|  .387 

.464 

.592 

mul2 

256 

|  1.51 

.998 

1  5.9 

5.1 

|  .864 

.602 

.572 

128 

|  1.31 

.998 

|  25.4 

21.9 

|  .862 

.602 

.658 

MUL3 

256 

|  .932 

.966 

|  2.6 

.4 

|  .150 

.147 

.160 

128 

|  .925 

.989 

|  6.5 

1.1 

|  .161 

.147 

.174 

Table  1.  Summary  of  simulation  results  for  16  processors. 

Sixteen  samples  were  used  to  determine  averages. 


C.  AN  EMPIRICAL  RELATIONSHIP 

Aside  from  confirming  intuition.  Table  1  appears  to  show  a 
relationship  between  sensitivity  of  Fortran  codes  and  their  memory 
utilization. 

Define 

_  m  average  number  of  memory  reads  ad  writes  per  processor 
um  “  algorithm  execution  time  per  processor 

Then  the  ratio  S  ,/u  is  shown  in  Table  1  to  range  over  a  rather 

ai  m 

restricted  set  of  values  (.552  to  .766),  across  different  codes 
and  in  the  presence  of  access  delays  D„  which  vary  over  a  5: 1 

dC 

range  (  .051  to  .254)  as  the  number  of  banks  is  varied.  An 
approximate  sensitivity  determined  from 

Sal  ’  ‘65  \ 

would  be  within  18%  for  all  cases.  The  range  of  this 
approximation  is  limited  however,  if  Sal  is  bounded  by  unity. 

The  relationship  between  and  Um  is  felt  to  be  indirect; 

possibly  it  is  due  to  the  number  of  ports  rather  than  Um  which 
supply  vector  operands  to  the  matrix  multiply  inner  loop.  If  one 
of  these  ports  is  delayed,  a  "worst-case"  delay  is  imposed  on  the 
loop  and  Sal  increases. 

D.  CONCLUSIONS 

In  this  section,  two  results  stand  out. 


(a)  The  randomness  rather  than  the  size  of  the  access  delays 
have  the  greatest  effects  on  the  critical  path. 

(b)  If  the  memory  utilization  per  processor  £Tm  is  known,  the 
algorithm  sensitivity  to  access  delays  may  be  estimated 
from  the  rule-of-thumb 


for  conventional  Fortran  vector  codes.  This  puts  the 
simpler  memory  access  simulation  performed  by  computer 
architects  on  firm  grounds ,  in  so  far  as  their  ability 
to  predict  algorithm  delay. 

The  above  conclusions  are  based  on  three  Fortran  codes;  this 
must  be  acknowledged  as  a  small  sample,  in  spite  of  the  diversity 
of  their  access  patterns.  Also,  the  vector  length  was  constant  at 
64;  the  above  formula  could  also  depend  on  VL,  since  for  a  given 

TJm,  the  critical  path  would  likely  be  more  disrupted  by  short 


vectors . 


III.  ALGORITHM  DELAY  RESULTS 

Although  the  above  study  of  the  two  components  of  Da^  (*  Dac 
Sai)  may  give  insight/  the  algorithm  delay  Dal  itself  is 
ultimately  of  interest.  These  are  depicted  in  Figures  2  and  3  and 
given  in  Table  2. 

Figure  2  gives  Dal  when  R,^  *  16#  the  most  likely  situation. 
The  FFT #  MOL,  and  CFD  codes  have  nearly  the  same  delay  between  2% 

and  3%  from  1  to  16  processors,  corresponding  to  their  simular  Um. 
MUL2#  with  high  access  delay  and  large  Sal,  has  nearly  a  5%  delay. 
Their  are  seemingly  no  surprises  here. 

When  the  number  of  banks  is  halved.  Figures  3  indicates  that 
Sal  of  the  high  access  MUL2  code  increases  by  the  greatest  ratio 
(4.1:1).  Even  the  small  differences  been  curves  in  Figure  2  are 
magnified  in  Figure  3.  The  implication  is  that,  since  Sal  remains 
relatively  constant  (Table  2)  as  Rfap  increases,  Rbp  *  16  is  the 

smallest  ratio  which  avoids  the  risk  of  high  's  with  common  U  's, 

ac  m 


-11- 


•Xs'.V, 
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IV.  CONFLICT -RES I ST ANT  ALGORITHMS 


A.  INTRODUCTION 

Undoubtedly,  the  greatest  benefit  of  defining  an  algorithm 
sensitivity  is  in  algorithm  design.  It  will  be  shown  possible, 
with  careful  control  of  the  data  flow  in  each  processor  using 
assembly  language  (CAL),  to  defeat  the  normal  relationship  between 
Dac  and  and  ultimately  to  reduce  the  small-delay  (incremental) 
sensitivity  to  zero.  Since  assembly  coding  is  a  common  practice 
for  CRAY-1  and  CRAY  X-MP  library  programs,  it  may  take  a  small 
additional  effort  to  isolate  these  workhorse  codes  from  the  large 
delays  possibly  associated  with  many-processor  architectures. 

B.  LOCAL  MEMORY  UTILIZATION 

Two  conditions  must  be  met  to  guarantee  code  performance 
resistant  to  access  delays. 

(a)  Shared  memory  access  must  be  off  the  critical  path,  and 

(b)  Vector  access  must  be  on  data  in  conflict-free  storage. 
The  vector  register  set  forms  such  storage  on  the  X-MP;  the  former 
can  be  achieved  by  pre-fetching  operands  and  post-storing  results. 

Prefetching  is  difficult  to  achieve  for  general  codes,  and, 
where  possible,  usually  requires  loop  reordering  and  other 
instruction  scheduling  beyond  commercially-available  compilers. 
Library  programs,  which  are  often  built  around  a  small  kernel,  are 
candidates  for  such  coding.  Fortunately,  CRAY-1  experience  has 
shown  that  prefetching  can  be  completely  masked  by  floating  point 
computation  in  linear  algebra  codes  without  reducing  the  execution 
rate;  the  vector  register  set  is  sufficiently  large  to  act  as  a 
non-conflict  buffer  [2] . 


A  matrix-vector  multiply  (MUL3 )  has  been  assembly  coded  with 
these  features;  the  resultant  sensitivities  are  shown  in  Table  1 . 
The  related  and  Sa^'s  are  a  small  fraction  (20-25%)  of  those  for 
the  corresponding  MULj  Fortran  code ,  and  considerably  less  than 
any  other  kernel  in  the  table. 

C.  NON-UNIT  STRIDE  ACCESS 

Sfll  of  the  inner  loop  of  MUL3  has  a  zero  value  for  small 
delays.  However,  the  CAL  implementation  that  yielded  the  low  Sal 
of  MUL3  in  Tables  1  and  2  for  64  *  16  matrix-vector  multiplies 
produced  quite  different  results  when  64  *  64  matrices  were 
multiplied,  as  indicated  in  Table  3a.  The  sensitivities  Sal 

increase  with  number  of  banks,  although  Dac  and  decrease 
individually.  The  orgin  of  the  problem  seems  worthy  of  discussion 
It  is  a  convenience  in  CAL  coding  of  matrix  multiplies  and 
other  linear  algebra  codes  to  implement 

Y  ♦  Y  +  MX 

by  loading  the  elements  of  X  in  reverse  order  into  a  vector 
register,  and  then  arranging  them  as  scalars  to  multply  the 
columns  of  M.  The  related  assembly  code  has  the  form 
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Modified 

Code 

16 

17.6 

1.47 

.083 

32 

1.74 

.391 

.224 
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1.53 

.271 

.178 

Table  3.  Effect  of  eliminating  counter-grain 
access  in  64  x  64  multiply;  p  *  4. 
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Figure  4 


Effect  of  negative-stride  access 
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.  Delay  and  sensitivity  summary  for 
two  R^p  ratios.  X-MP-2  protocol. 


Table  2 


A  delay  in  the  load  of  VO  may  delay  the  first  trip  through  the 
inner  loop  if  VL  is  sufficiently  long,  since  the  load  of  SI  will 
not  chain  off  the  read.  Worse,  the  read  has  a  negative  unit 
stride,  whereas  all  other  accesses  have  positive  unit  strides. 
Figure  4  shows  the  effects  of  such  an  access  on  the  other  highly- 
regular  accesses.  Access  Z3,  beginning  at  clock  5588,  intesects 
and  delays  seven  other  accesses,  two  of  them  twice.  It  is  evident 
that  a  window  of  accesses  extending  approximately  -VL  and  +2VL 
clocks  from  the  initiation  of  VO  is  potentially  affected  by  such  a 
counter-grain  access.  The  access  Z3  itself  is  delayed  by  28 
clocks. 

When  the  access  is  replaced  by  a  positive  unit-stride  access, 
the  low  sensitivity  of  the  modified  code  of  Table  3  is  obtained. 
The  Dal  is  reduced  to  insignificant  levels  (.272%)  for  a  typical 
Rbp  =  16,  and  retains  these  levels  (1.47%)  when  the  number  of 
banks  is  reduced  by  a  further  factor  of  4! 
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APPENDIX  A 


EXPERIMENT  DESCRIPTION 

EXPERIMENTAL  PARAMETERS 

The  codes  were  produced  by  the  X-MP  CFT  compiler  from  Fortran 
source  codes.  Vector  length  (VL)  is  64  and  stride  is  1  for  all 
cases . 

Distinct  program  and  data  storage  was  used  for  each  of  the  16 
processors.  Code  executions  were  initiated  at  irregular  intervals 
to  further  randomize  accesses  between  processors.  In  general,  p 
samples  were  used  to  produce  mean  values  with  p  procesors . 

Two  global  static  measures  of  memory  accesses  were  made  to 
monitor  their  uniformity. 

(a)  Memory  utilization.  This  is  the  fraction 

TJ  =*  Total  operands  and  results 
m  Simulation  time  (CP's) 

for  the  average  processor;  it  is  a  measure  of  memory  traffic  for 
each  code,  and  has  a  maximum  value  of  3,  corresponding  to  the 
number  of  memory  ports  per  processor.  Table  1  shows  Um  -  .67  for 
FFT,  CFD,  and  MUL,. 

(b)  Bank  utilization.  Let  N^  be  the  number  of  banks.  There 
is  a  risk  with  64-length  unit-stride  vectors  and  Nfa  >  64  that 
banks  will  not  be  equally  utilized;  this  would  create 
uncharacteristic  delays  in  heavily-utilized  banks.  If  N 

is  the  average  number  of  accesses  per  processor  across  all  banks, 
and  N  is  the  standard  deviation  from  this  average,  define  the  bank 
utilization 


*  1  indicates  uniform  accessing;  if  only  1/2  of  the  banks  are 
accessed,  Ub  *  1/2.  Table  1  indicates  .832  <  Ub  <  .998.. 

CODE  DESCRIPTIONS 

(a)  Fluids  kernel  (CFD) .  Taken  from  the  vectorized  code  of 
[7],  this  is  a  32-statement  single-loop  Fortran  kernel  with  an 
average  of  3.2  64-length  vector-vector  operations/statement.  Lack 
of  a  repetitive  computational  structure  like  FFT  and  MOL  should 
make  the  access  pattern  the  most  random.  Six  buffer  fetches  occur 
in  one  kernel  execution. 

(b)  FFT  kernel  (FFT).  This  code  determines  multiple  8-point 
complex-complex  FFT's.  Five  buffer  fetches  occur  in  one  kernel 
execution. 

(c)  Matrix-vector  multiply  kernel  (MUL1 ,  mul2  ,  MUL3 ) .  The 
inner-loop  of  MUL y  and  MUL2  has  two  vector  reads  and  one  write  per 

execution.  MUL ^  maintains  low  memory  utilization  (Um  =  .69)  with 
VL  *  64  by  multiplying  4  small  (64  x  3)  matrices  in  one  kernel 
execution  step;  MUt^uses  the  same  code  with  512*2  matrices,  which 
successively  exercises  the  inner-loop  512/64  *  8  times,  and 
achieves  *  1.58,  a  value  more  characteristic  of  a  large 
Fortran-coded  matrix  multiply  on  the  X-MP.  No  buffer  fetches 
occur  in  consecutive  executions  of  the  kernel.  The  inner  loop  of 
MUL3  has  one  pre-f etched  vector  read  per  inner  loop  execution. 
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